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Abstract 

In this article we propose a new method, which we name 'quartet 
neighbor joining', or 'quartet-NJ', to infer an unrooted species tree on 
a given set of taxa T from empirical distributions of unrooted quartet 
gene trees on all four-taxon subsets of T. In particular, quartet-NJ 
can be used to estimate a species tree on T from distributions of gene 
trees on T. The quartet-NJ algorithm is conceptually very similar 
to classical neighbor joining, and its statistical consistency under the 
multispecies coalescent model is proven by a variant of the classical 
'cherry picking'-theorem. In order to demonstrate the suitability of 
quartet-NJ, coalescent processes on two different species trees (on five 
resp. nine taxa) were simulated, and quartet-NJ was applied to the 
simulated gene tree distributions. Further, quartet-NJ was applied to 
quartet distributions obtained from multiple sequence alignments of 28 
proteins of nine prokaryotes. 
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1 Introduction 

As the amount of available sequence data from genomes or proteoms rapidly 
grows, one is confronted with the problem that, for a given set of taxa, tree 
topologies relating homologous sequences of these taxa can (and in prac- 
tice often will) differ from the each other, depending on which locus in the 
genome or which protein is used for constructing the tree. Possible reasons 
for discordance among gene trees and the species phylogeny include hor- 
izontal gene transfer, gene duplication/loss, or incomplete lineage sorting 
(deep coalescences). Thus one is confronted with the task to determine the 
phylogeny of the taxa (the 'species tree') from a set of possibly discordant 
gene trees (Maddison [8], and Maddison and Knowles [9]). 



Several methods have been proposed and are used to infer a species tree 
from a set of possibly discordant gene trees: 

(1) Declaring the most frequently observed gene tree to be the true 
species tree was shown to be statistically inconsistent under the multi- 
species coalescent model by Degnan and Rosenberg [4], in cases where branch 
lengths of the underlying species tree are sufficiently small. 

(2) A popular approach for species tree reconstruction is by concatenat- 
ing multiple alignments from several loci to one large multiple alignment 
and construct a single 'gene tree' from this multiple alignment by standard 
methods. This approach was pursued e.g. by Teichmann and Mitchison [T7] 
and Cicarelli et al. [2]. However, also this method was shown to be inconsis- 
tent under the multispecies coalescent by Degnan and Kubatko [5] , if branch 
lengths on the species tree are sufficiently short. 
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(3) Similarly, the concept of minimizing the number of 'deep coales- 
cences' (Maddison [8]) was recently shown to be statistically inconsistent 
under the multispecies coalescent model by Than and Rosenberg [19] , 

(4) On the other hand, it is well known from coalescent theory that 
the probability distribution of gene trees, or even the distributions rooted 
gene triplet trees, on a species tree uniquely determine the species tree if 
incomplete lineage sorting is assumed to be the only source of gene tree 
discordance. Similarly (see Allman et al. PQ), the distributions of unrooted 
quartet gene trees identify the unrooted species tree topology. However, as 
soon as triplet/quartet gene trees are inferred from experimental data, their 
distributions will differ from the theoretical ones, and this may lead to con- 
flicts among the inferred triplets/quartets on the hypothetical species tree, 
a problem which is not straight forward to resolve ('quartet-puzzeling' could 
be an approach to resolve this, see Strimmer and von Haeseler |16j). Also di- 
rect maximum likelihood calculations using gene tree distributions are very 
problematic due to the enormous number of possible gene trees on a given 
set of taxa. 

(5) Recently Liu and Yu [7] have proposed to use the 'average internode 
distance' on gene trees as a measure of distance between taxa on species trees 
and apply classical neighbor joining to these distance data. They prove that 
this reconstructs the underlying species tree in a statistically consistent way. 

In the present paper we propose a method to overcome many of the 
difficulties discussed above, and in particular to make the above mentioned 
result by Allman et al. [1] accessible in practice. That is, we describe a 
polynomial time algorithm (in the number of taxa), which uses empirical 
distributions of unrooted quartet gene trees as an input, to estimate the 
unrooted topology of the underlying species tree in a statistically consistent 
way. Due to the conceptual similarity to classical neighbor joining we call 
this algorithm 'quartet neighbor joining', or briefly 'quartet-NJ'. 

The fact that quartet-NJ uses quartet distributions as an input makes it 
flexible in practice: Quartet gene tree distributions can be obtained directly 
from sequence data (as is described in the application to a prokaryote data 
set in Section [5] of this paper), or e.g. from gene tree distributions, which 
is the type of input required by Liu and Yu's 'neighbor joining for species 
trees' [Zj. Also, quartet-NJ is naturally capable of dealing with multiple 
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lineages per locus and taxon. 

The paper is organized as follows: After a brief review of the multispecies 
coalescent model and distributions of quartet gene trees in Section [2 we 
investigate in Section [3] how to identify cherries on a species tree. To this 
end we show how to assign 'weights' to unrooted quartet trees on the set 
of taxa, using the distributions of quartet gene trees, and how to define a 
'depth' for each pair of taxa. In analogy to classical cherry picking we prove 
that under the multispecies coalescent model any pair of taxa with maximal 
depth is a cherry on the species tree. Moreover, we give an interpretation of 
this theorem by the concept of 'minimal probability for incomplete lineage 
sorting', in analogy to the concept of minimal evolution underlying classical 
neighbor joining. 

In Section H] we translate our cherry picking theorem into a neighbor 
joining-like procedure ('quartet-NJ') and prove that it reproduces the true 
unrooted species tree topology as the input quartet distributions tend to the 
theoretical quartet distributions. In other words, quartet-NJ is statistically 
consistent. 

Finally, in Section [5] we apply the quartet-NJ algorithm to data from 
coalescence simulations, as well as to a set of multiple alignments from nine 
prokaryotes, in order to demonstrate the suitability of quartet-NJ. In both 
situations we consider only one lineage per locus and taxon. 

2 The multispecies coalescent model 

We are going to present briefly the multispecies coalescent, which models 
gene (or protein) tree evolution within a fixed species tree. The material 
in this section is neither new nor very deep. Rather this section is to be 
considered as a reminder for the reader on the relevant definitions, as well 
a suitable place to fix language and notation for the rest of the paper. For 
a more detailed introduction to the multispecies coalescent model we refer 
e.g. to Allman et al. [1]. 

2.1 Overview 

Let us consider a set T of n taxa (e.g. species) and a rooted phylogenetic 
(that is: metric with strictly positive edge lengths and each internal node is 
trivalent) tree S on the taxon set T. The tree S is assumed to depict the 
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'true' evolutionary relationship among the taxa in T, and the lengths of its 
internal edges are measured in coalescence units (see below). This tree is 
commonly called the species tree on the taxon set T. 

It is a fundamental problem in phylogenetics to determine the topology 
of this tree. Molecular methodes, however, usually produce evolutionary 
trees for single loci within the genome of the relevant species, and these 
gene trees will usually differ topologically from the species tree (Degnan 
and Rosenberg [4]). This has several biological reasons, like horizontal gene 
transfer, gene duplication/loss or incomplete lineage sorting. The latter 
describes the phenomenon that to gene lineages diverge long before the 
population actually splits into two separate species, and in particular gene 
lineages may separate in a different order than the species do. 

The multispecies coalescent model describes the probability for each 
rooted tree topology to occur as the topology of a gene tree, under the 
assumption that incomplete lineage sorting is the only reason for gene tree 
discordance. 

Notation: We adopt the common practice to denote taxa (species) by 
lower case letters, while we denote genes (or more general: loci) in their 
genome by capital letters. E.g. if we consider three taxa a,b,c G T, the 
letters A,B,C will denote a particular locus sampled from the respeceive 
taxa. (By abuse of notation, we will identify the leaf sets of both species 
and gene trees with T). 

Considering only a single internal branch of the species tree, and two 
gene lineages within this branch going backwards in time, we find that the 
probability that the two lineages coalesce to a single lineage within time r 
is given by 

P = 1 - exp(-r) 

where r is in coalescence units (that is number of generations represented 
by the internal branch divided by the total number of allele of the locus 
of interest, present in the population). In particular, if the branch on the 
species tree has length d, the probability that two lineages coalesce within 
this branch is 1 — exp(— d). 

From this, Nei [12] derives the following probabilities under the multi- 
species coalescent model for a three taxon tree. Denote the taxa by a, 6, c, 
the corresponding loci by A, B, C, and assume that the species tree has the 
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topology ((a, b),c) with internal edge length d > 0. Then the probability 
that a gene tree sampled from these taxa has the topology ((A,B),C) is 
equal to 

2 

1 - g exp(-d), 

while the other two topologies are observed with probability |exp(— d). 

In [3] Degnan and Salter show how to calculate the probabilities of a 
gene tree in a fixed species tree on an arbitrary taxon set. It turns out 
that these probabilities are polynomials in the unknonws exp(— tig), where 
dE denotes the length of the edge E on the species tree. In particular, the 
multispecies coalescent model is an algebraic statistical model. 

2.2 Quartet distributions 

In the following we consider four taxon species trees on the taxon set T = 
{a,b,c,d}, and we use Newick notation to specify (rooted) species trees. 
For instance Si = (((a, 6) : x,c) : y,d) denotes the caterpillar tree with 
edge lengths x and y in coalescent units. Moreover, unrooted quartets are 
denoted in the format (AB, CD), meaning the quartet gene tree which has 
cherries (AB) and (CD). 

Up to permutation of leaf labels there is only one additional tree topology 
on four taxa, namely the balanced tree 52 = ((a, b) : x, (c, d) : y). For both of 
these two species tree topologies it is not hard to calculate the probabilities 
of the tree possible unrooted quartet gene trees on T, and one obtains 

Lemma 1 (AUman et al. pQ). For both species tree topologies S = S\ and 
S = S2 the probabilities of unrooted quartet gene trees on T have the same 
form and are given by the formulas 

P(AB,CD) = l-^M-x-y)>\, (2.1) 
P(AC,BD) = ±exp(-x-y)<±, (2.2) 

P(AD,BC) = ^e W (-x-y)<\- (2-3) 

In particular, the gene quartet distribution determines the unrooted 
species tree topology, but not the position of the root (Allman et al.). The 
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sum of the internal edge lengths of the rooted species tree (resp. the length 
of the internal edge of the unrooted species tree) is given by the formula 

d(ab, cd):=d = x + y = - log(|(l - P(AB, CD))). (2.4) 

Returning to the study of the multispecies coalescent on species trees on 
arbitrary taxon sets, we deduce from the above that the quartets displayed 
on the species tree S on the taxon set T are exactly those which appear with 
a probability bigger than | for any sampled locus. Hence, the distributions 
of quartets determine uniquely the quartet subtrees which are displayed by 
the true species tree S, and hence determine the unrooted topology of S 
(see Allman et al. [I]). 

In the following two sections we describe a neighbor joining algorithm 
which makes this theoretical insight applicable in practice, meaning that 
it yields a method to estimate unrooted species trees from (empirical) dis- 
tributions of quartet gene trees, which is statistically consistent under the 
multispecies coalescent model. Statistical consistency of this algorithm will 
follow from the 'cherry picking theorem' below. 

3 A cherry picking theorem 

Here we give a precise criterion, using theoretical distributions of quartet 
gene trees under the multispecies coalescent, to determine which pairs of 
taxa on the species tree are cherries. This criterion can as well be applied to 
estimated quartet distributions and thus enables us to recursively construct 
a species tree estimate from observed gene quartet frequencies in Section HI 

3.1 Depth of a pair and cherry picking 

As always, we consider a fixed species tree S on the taxon set T, and we 
denote by P(IJ,KL) := P$(IJ,KL) the probability that a gene tree on T 
displays the gene quartet tree (IJ,KL). 

Recall that by lower case letters i, j,l,k we denote the species from which 
the genes /, J, K, L are sampled. Regardless of whether S contains the 
(species-) quartet (ij,kl), we can attach to it the following numbers: 

Definition 2 (Weight of a quartet). The weight of the quartet (ij,kl) is 



7 



defined as 

w(ij,kl) = -log(^(P(IK,JL) + P(IL,JK))). (3.1) 

If the taxa i,j, k, I are not pairwise distinct, then we set w(ij, kl) = 0. 
Of course, if i,j, k, I are pairwise distinct we may equivalently write 

w(ij, kl) = -\og(^(l-P(IJ,KL))). 

Lemma 3. If the quartet (ij, kl) is displayed on the species tree S, then the 
weight w(ij, kl) is precisely the length of the interior branch of the quartet 
(ij,kl). Moreover, the other two weights, w(ik,jl) and w(il,jk), are less 
than zero. 

Proof. The proof is immediate using equation l[2.4j) : If the quartet tree 
(ij, kl) is displayed on the species tree S, then 

w(ij, kl) = - \og(^(P(IK, JL) + P(IL, JK))) = 

= - log(|| exp(-d(ij, kl))) = d(ij, kl). (3.2) 

Otherwise, if (ij, kl) is not displayed on S, then one of the other two quartet 
trees on {i,j,k,l} is displayed, say (ik,jl), and we have 

w(ij, kl) = - \og(^(P(IK, JL) + P(IL, JK))) = 

= -log(|(l -±exp(-d(ik,jl)))) <-Iog(||) = 0, (3.3) 

since d(ik,jl) > 0. □ 

This little lemma motivates the following definition and a cherry picking 
theorem which is formulated and proved in close analogy to the 'classical' one 
(see Saito and Nei |13j for the original publication, or Pachter and Sturmfels 
|14j for a presentation analogous to ours). 

Definition 4 (Depth of a pair). For any pair of taxa i,j E T we define the 
depth of(i,j) to be the number 

f(i,j)= w(ij,kl). (3.4) 

k,leT 
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(Recall that if {i,j} n {k, 1} ^ £/ien w(ij, kl) = 0.) 

Theorem 5 (Cherry Picking). If a pair of taxa i,j £ T has maximal depth 
f(i,j), then (ij) is a cherry on the species tree S. 

As mentioned above, the proof of this theorem is parallel to the proof of 
the classical cherry picking result as presented in [14]. We state it here for 
sake of completeness of our exposition. 

Proof. As a first step, we introduce the auxiliary values 

_ J w(ij, kl) if (ij, kl) is displayed on S, 
v[ij, Kl) :— < (o.o) 
I 1) else, 

for every four taxon subset k, 1} C T, as well as 

d(hj) ■= ^2v(ij,kl) (3.6) 

k,l 

for every pair of taxa. Obviously, if (ij) is a cherry on S, then w(ij, kl) = 
v(ij, kl) for all k,l £ T, and hence also g(i,j) = f(i,j)- In general, for any 
pair of taxa (i,j) one has g(i,j) > f(i,j) by Lemma O 

We will now assume that the pair (i,j) does not form a cherry on the 
species tree S and prove that in this case there exists a cherry (pq) on S 
such that the following inequalities hold: 

f(p,q) = g(p,q) > g(hj) > f(hj)- (3-7) 

From this we deduce the claim of the theorem: If (ij) is not a cherry on S, 
then it does not have maximal depth. 

Proof of the claim. We have to find a cherry (pq) on S such that indeed 
9(P}Q) > dihj) holds. As we assume i and j not to form a cherry on 
S, the unique path connecting i and j crosses at least two interior nodes 
(symbolized as black dots in figure [TJ We denote the number of internal 
nodes on this path by r > 2. To each s = 1, . . . , r a rooted binary tree Tj 
is attached (see Figure [TJ. By interchanging i and j, if necessary, we may 
assume that the number of taxa in T\ is less or equal than the number of 
taxa in T r . 
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Figure 1: The nodes i and j do not form a cherry. 




<D 



Ti 




First we consider the case that the subtree T\ consists of a single leaf il . 
In this case the pair (i, i') is a cherry on S, and it is easy to see that indeed 
g(i,i') > g{i,j). 

Second, if T\ has more than one leaf, let choose a cherry (pq) in T\ (every 
rooted binary tree with more than one leaf has at least one cherry!). Now we 
decompose the difference g(p,q) — g(i,j) into six sums which will be treated 
separately: 

g( P ,q)-g(i,j) = S 1 + ... + S 5 , (3.8) 

where 



Si= (v(pq,kl) -v(ij,kl)), 

k,leT m ,m=2,...,r 

S 2 = ^2 (v{pq,kl) -v(ij,kl)), 

kET m \{p,q},l&T m i\{p,q},m,m'=l,...,r,m^m' 

S 3 = ^2 (v(pq,kl) -v(ij,kl)), 
fc,zeTi\{ P , g } 

S4 = ^2 (v(pq, hi) + v(pq, kj) - v(ij, kq) - v(ij, hp)), 

k£T m ,m=2,...,r 

S 5 = ^2 ( V (P1, hi) + v(pq, kj) - v(ij, kq) - v(ij, kp)) + 
+ (-v(pq,ij) +v(ij,pq)). 

Inspection of the tree in Figure Q] immediately shows that each summand 
in S\ is positive, and more precisely, greater or equal to d(ij,pq). Hence 
Si > (^ T 2^)d(ij,pq). In S2, the expressions v(ij,kl) all vanish, hence this 
sum is positive. The third sum S3 might be negative - but not too negative: 
The situation is illustrated by Figure [2] (which arises from the tree in Figure 
[T]by deleting the irrelevant subtrees T2, . . . , T r ), whose inspection shows that 
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each for each summand v(pq, kl) — v(ij, kl) we have 

v(pq, kl) — v(ij, kl) = d(pq, kl) — d(ij, kl) > —d(ij,pq). 

Thus we see S 3 > -i} Tl )^ 2 )d(ij,pq) > -(^)d(ij,pq), whence Si + S 2 + S3 
is positive. 

Now we treat the frouth sum: If A; is a node in one of the subtrees 
T2, . . . ,T r , then v(ij,kq) = v(ij,kp) = 0, so S4 is non-negative. More pre- 
cisely we have S4 > 2('^ r ')i> (pq, ij). On the other hand, the sum S5 is 
greater or equal — 2(\ Tl )~ 2 )v{ij ,pq) > —2{j T ^)v(pq,ij), whence also S4 + S5 
is positive. In total we have found that in any case g(p,q) > g{i,j), which 
proves our claim. □ 

As explained above, this suffices to prove the cherry picking theorem. □ 

3.2 Statistical consistency of cherry picking 

In practical applications one will not know the precise probability of each 
quartet gene tree (AB,CD). Hence one has to use e.g. relative frequencies 
as estimates. Let us denote by r(AB, CD) the (experimentally determined) 
relative frequency of the gene quartet (AB, CD). In analogy to Definitions 
[2] and H] we make the following 

Definition 6. The empirical weight of the quartet (ij, kl) of taxa is defined 
as 

w(ij, kl) = - log(^(r(IK, JL) + r(IL, JK))), (3.9) 
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Moreover, the empirical depth of a pair of taxa is defined as 

k,leT 

Under the multispecies coalescent model, the relative frequency r(IJ, KL) 
of a quartet gene tree (IJ, KL) is a statistically consistent estimator for its 
probability P(IJ, KL). Hence the empirical depth f(i,j) of a pair of taxa 
i,j is a statistically consistent estimator for f(i,j). In particular this means: 
inferring a pair (i,j) of taxa, where f is maximal, as a cherry on the species 
tree is statistically consistent. More precisely, we have 

Corollary 7. Let N be the number of loci sampled from each taxon in T and 
denote by Cn the set of pairs of taxa at which f attains its maximum. If the 
number of genes N tends to infinity, then the probability that Cjv contains 
a pair which is not a cherry on the true species tree approaches zero. 

Proof. As there are only finitely many taxa in T, there is a strictly positive 
difference between the maximum value of / on T 2 and its second biggest 
value. Since moreover / and / are continuous, there exists an 5 > such 
that whenever \r(IJ, KL) — P(IJ, KL)\ < 5, f(p,q) is maximal if and only 
if f(p,q) is maximal, and the claim follows from Theorem [SJ But the prob- 
ability that \r(IJ, KL) — P{IJ,KL)\ < S approaches 1 as N grows, for any 
choice of 5. □ 

3.3 Minimal probability for incomplete lineage sorting 

Recall that classical neighbor joining is a greedy algorithm which in each 
'cherry picking' step declares a pair of taxa to be neighbors if this minimizes 
the sum of branch lengths in the refined tree resulting from this step (Saito 
and Nei [13]). In other words, classical cherry picking and neighbor joining 
is guided by the principle of minimal evolution. It turns out that our cherry 
picking result in Theorem [5] can be interpreted in a similar fashion. 
Let us make the following ad-hoc definition. 

Definition 8. Let X±, . . . ,X n be independent random variables with values 
in {0, 1}, and let pi be the probability of {X. L = 1} for each i = 1, . . . , n. We 
call the geometric mean 

P= tyPi--- Pn (3.11) 
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the average probability of the random experiments X\ , . . . , X n giving the 
result 1. 

Let us now consider what the cherry picking Theorem [5] does with a set 
of taxa T, initially arranged as a star-like tree (see Figure [3]). Consider two 
taxa i / j G T hxed and let q(ij,kl) = P(IK,JL) + P(IL,JK). On a 
species tree on T which displays the cherry (ij), this is the probability that 
a gene quartet tree sampled from the taxa i,j,k,l differs from the species 
quartet tree (ij,kl). Let 



be the average probability of discordance of a sampled gene quartet tree 
with the corresponding quartet on the species tree, for a set of four taxa 
containing % and j. We will also call this number the 'average probability of 
incomplete lineage sorting' for quartets containing the taxa i and j. With 
this terminology, the cherry picking Theorem [5] can be phrased as follows. 

Theorem 9. A pair of taxa i,j £ T is a cherry on the true species tree 
S if it minimizes the average probability for incomplete lineage sorting for 
quartets containing the taxa i and j. 

Proof. Using Theorem [5] we only have to show that q(i,j) is minimal if and 
only if f(i,j) is maximal. But this follows from the fact that 



4 Quartet neighbor joining algorithms 

In this section we discuss how the above cherry picking result can be used 
to design an algorithm which estimates a species tree from (observed) gene 
quartet tree frequencies. 




logq(i,j) + C, 



where C is a constant independent of i,j. 



□ 
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Figure 3: One cherry picking step. 




4.1 A naive neighbor joining algorithm 

The first version of our neighbor joining algorithm reconstructs the under- 
lying species tree from the (theoretical) gene quartet tree distributions. 

Algorithm 10. Input: A set of taxa T = {t±, . . . ,t n } containing at least 
four elements, and for each quartet Q = (ab, cd) with a,b,c,d £ T the 
probability r(Q) := Ps(Q) of the quartet Q under the multispecies coalescent 
model on the species tree S. 
Output: The species tree S. 

Step 0. Set S to be the graph with vertex set T and with empty edge 

set. 

Step 1. If \T\ < 3 go to step 4. 

Step 2. For each unordered pair {i,j} C T calculate the depth f(i,j). 
Let i, j £ T be a pair of taxa with maximal depth. Add a node x to S and 
draw edges from i to x and from j to x. Replace T by T U {x} \ {i,j}. 

Step 3. For each quartet (xa, be) with a,b,c £ T \ {x} set 

r'(xa,bc) = -(r(ia,bc) + r(ja,bc)), 

and for all quartets (ab, cd) which do not contain x set r'(ab, cd) = r(ab, cd). 
Replace r by r' . Go to step 1. 

Step J,.. If\T\ =3, then add a vertex to S and draw edges from this new 
vertex to the three elements ofTc vertices(S). □ 

Remark 11. Each calculation of f(i,j) requires 0(\T\ 2 ) logarithms and 
additions. Repeating this for every pair i,j of taxa gives a computational 
complexity ofO(\T\ A ) for step 2. This is also the complexity of one iteration 
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of the algorithm. Since each step will identify exactly one cherry, we need 
0(\T\ 5 ) iterations to reconstruct S, which gives a total complexity ofO(\T\ 5 ) 
for Algorithm \1 (A It is rather straightforward to improve step 2 so that the 
algorithm requires only 0(|T| 4 ) arithmetic operations (see Subsection \4-3\ ). 

Using the result of Theorem^ we may summarize: With Algorithmic we 
have described a greedy algorithm which inferes in each step the cherry which 
requires minimal incomplete lineage sorting. Further, the fact that cherry 
picking is statistically consistent implies that also species tree estimation by 
Algorithm 1101 is statistically consistent: 

Corollary 12 (Statistical consistency of Algorithm 1 10p . Let N be the num- 
ber of loci sampled from each taxon in T and denote by Sjy the estimate 
for the species tree produced by Algorithm \1(A If the number of genes N 
goes to infinity, then the probability that Sjf equals the true species tree S 
approaches 1. 

Proof. This follows by a repeated application of Corollary [7J □ 

In practical applications two problems may occur using Algorithm [10J 
First, there might be several non-disjoint pairs of taxa with maximal depth. 
This could be resolved by chosing one of these pairs randomly. A more seri- 
ous problem occurs when many of the empirical gene quartet distributions 
are 0. This will be discussed in the following subsection. 

4.2 Perturbing weights and quartet-NJ 

Algorithm 1101 works well if one uses as input the theoretical probabilities for 
quartet gene trees. It also works well, if one uses relative quartet frequencies 
which are 'very close' to the probabilities P(IJ, KL) (and in particular non- 
zero). In other cases, the result might be problematic, as we are going to 
explain now. 

Imagine that for some reason (e.g. very long branch lenghts on the species 
tree) the observed quartet gene trees reflect very precisely the topology of 
the species tree. This might mean in the extreme case that for every four- 
taxon subset {I, J, K, L] one observes only this very quartet gene tree which 
is displayed also by the (unknown) species tree. In some sense this situation 
should be optimal, since the observed gene quartet trees fit together without 
conflict and thus yield an unambiguous estimate for the species tree. 
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However, let us run Algorithm [10] with the empirical depth function / 
in place of / and consider any pair i,j of taxa. Regardless of the choice 
of i and j, there exist k,l S T such that (ij,kl) is a quartet on the species 
tree, and hence by our assumption r(IJ, KL) = 1. Calculating the empirical 
depth of yields 

f(i,j)= E -log(^(l-r(/J,KL)) = +oo. (4.1) 
k,ier 

Thus every pair (i, j) maximizes /, and so in each iteration of Algorithm [TU1 
the choice of a cherry is completely arbitrary. So this procedure fails in a 
situation, which has to be considered the 'easiest' possible in some obvious 
sense. 

A possible solution to this problem is to perturb the arguments in the 
logarithms in equation (|4.ip by a small number e > 0. In other words, we 
fix < e < 1 and calculate, for each pair of taxa i, j, the 'perturbed depths' 

f e (i,j)= £ -\ogi^{(l + e)-P(IJ,KL)) <oo, (4.2) 

k,leT 

f £ (i,j)= £ - log(^((l + e) - r(/J, KL)) < oo. (4.3) 

Obviously, for e -> we have f e (ij) -)• /o(*',j) = and / e (i,j) -4 

fo(hj) = /(*>j)) f° r an pairs G T 2 . From this we obtain the following 
easy 

Lemma 13. Assume that the theoretical probabilities P(U, KL) are known 
for each four taxon subset {I, J, K, L} C T and are used as input for Al- 
gorithm \10l Then there exists a constant c > such that Algorithm [73 
computes the same results with f e in place of f , for each < e < c. In other 
words, if S t is the species tree inferred by Algorithm \10\ with f e in place of 
f, then the limit lim5 e exists and is equal to S = Sq. 

<E-*-0 

Proof. This follows from the fact that S is a binary tree, that there are only 
finitely many values of / and that f e (i,j) depends continuously on e. □ 

This suggests that for practical applications it will be reasonable to fix 
e > 'small enough', and run Algorithm 1101 with the perturbed empirical 
depth function f e in place of / in order to avoid 'infinite depths' as in the 
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discussion at the beginning of this subsection. Hence we obtain the following 
modification of Algorithm [TUl 

Algorithm 14 (Quartet Neighbor Joining, Quartet-NJ). Input: A finite 
set of taxaT = {t\, . . . ,t n } , a ' small' constant e > , and for each four taxon 
subset {a,b,c,d} C T the relative quartet gene tree frequencies r(AB,CD), 
r(AC, BD), r(AD, BC). 

Output: An unrooted tree S e on T which is our estimate for the topology 
of the species tree S on T. 

Algorithm: Run Algorithm\TQwith the depth function f(i,j) substituted 
by the empirical depth f t (i,j). The result of this calculation is S e . □ 

Indeed, if we apply this modified algorithm to the problematic situation 
at the beginning of this subsection, we will obtain (for any choice of e) a fully 
resolved correct estimate for the species tree S. Of course, this algorithm 
has the same complexity as Algorithm 1101 

We want to claim that, for sufficiently small values of e, quartet-NJ 
produces a asymptotically correct estimate of the true species tree on T. To 
this end, we first prove 

Lemma 15. There exists a constant c > such that for all < e < c the 
trees S e are equal. In other words, the limit ]imS e exists. 

e->-0 

Proof. Since there are only finitely many cherries to pick, we may restrict 
our considerations to the picking of the first cherry. We have to distinguish 
two cases: First assume that there are taxa i,j such that f(i,j) = oo, i.e. 
the set Inf = {(i,j)\f(i,j) = oo} is nonempty. If e is small enough, then 
the maximum of the values of f e is attained at one of the pairs in Inf. At 
which of those pairs the maximum is attained then only depends on the 
number of summands in f t (i,j) which approach infinity as e tends to zero. 
This number is clearly independent of e, whence the set of potential cherries 
to pick is independent of e. 

In the second case, there are no elements in the set Inf. This means 
that each of the perturbed empirical depths f e (i,j) approaches the finite 
value f(i,j) as e tends to zero. This again means that, for e small enough, 
the pair which maximizes f e does not depend on e. □ 

Combining this with Lemma [131 we obtain the desired result. 
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Theorem 16 (Statistical consistency of quartet-NJ for small e). The limit 
lim(S' e ) exists and is a statistically consistent estimate for the true species 

e->0 

tree S = Sq. In particular, quartet-NJ ist statistically consistent. 

Proof. The existence of the limit in the theorem is established by Lemma [T5l 
It remains to prove that it is a statistically consistent estimate for S. Recall 
that the definition of the function f e (i,j) depends on the relative frequencies 
r(IJ, KL). If we consider the probabilities P(IJ, KL) known and fixed and 
moreover fix the pair of taxa then we may consider the function 

F(i,j) = fe(i,j) - f e (i,j) : R>o x R -> R, (4.4) 

depending on the 'variables' r(IJ,KL) and e. This is a continuous function 
which vanishes on the line {(P(IJ, KL)) {IJ)KL) )} xRc rCD x R. Thus for 
every 5 there exists an open ball centered at ((P(IJ, KL))( IJKL j, 0) which 
is mapped to (—5,5) C M by F(i,j). In particular, there exists a positive 
constant c > such that for every (i,j) we have \f t (i,j) — f e (i,j)\ < 5 if 
\r(IJ, KL) — P(IJ, KL)\ < c and e < c. If we chose 5 small enough, we thus 
may conclude that S e = S e for every e < c and for all relative frequencies 
satisfying \r(IJ,KL) — P(IJ, KL)\ < c. Moreover, by Lemma [131 we may 
assume, by reducing c if necessary, that S e = S for every e < c. 

Taking this together we obtain that, provided \r(IJ, KL)—P(IJ, KL)\ < 
c for every quartet (ij,kl), limS t = S. Since r(IJ,KL) is a statistically 

consistent estimate for P(IJ,KL), this proves that limS",; is a statistically 
consistent estimate for S. □ 

The question of how to find an appropriate e > to run the neighbor 
joining algorithm will of course depend on the special problem instance and 
is left open here. 

4.3 Reduction of computational complexity 

Here we briefly mention a complexity reduction for the quartet neighbor 
joining algorithm. Since Algorithm 1101 and [T41 are completely equivalent in 
this respect, we formulate this reduction only with the simpler notation of 
Algorithm [TO]. 

We consider step 2 in Algorithm [TOl In our present formulation, this 
step calculates the depth f(i,j) by adding up 0(|T| 2 )-many quartet weights 
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for each pair of taxa However, most of these quartet weights will not 
have changed during the last iteration of the algorithm. 

Assume that in the previous iteration the cherry (iojo) was identified 
and joined by a new vertex x. Then for any pair of taxa i,j which are still 
present in T we may calculate the new depth fnewihj) from the old one by 
the formula 

fnew(hj) = f(i,j) ~ ^2{w{ij,iok) +w(ij,j k) + ^ j w new (ij,kx). 
keT keT 

This reduces the complexity of step 2 from 0(|T| 4 ) to 0(|T| 3 ), and hence the 
overall complexity of Algorithms [10] and are reduced by this modification 
to 0(|T| 4 ). 

5 Application and simulations 

5.1 Application to a prokaryote data set 

In order to test quartet neighbor joining on real data, Algorithm [TH was 
applied to a set of protein sequences from nine prokaryotes, among them 
the two archaea Archaeoglobus fulgidus (AF) and Methanococcus jannaschii 
(MJ), as well as the seven bacteria Aquifex aeolicus (AQ), Borrelia burgdor- 
feri (BB), Bacillus subtilis (BS), Escherichia coli (EC), Haemophilus in- 
fluenzae (HI), Mycoplasma genitalium (MG), and Synechocystis sp. (SS). 

The choice of organisms follows Teichmann and Mitchison |17j . while 
the multiple sequence alignments where taken out of the dataset used by 
Cicarelli et al. in [2j. For each of the 28 protein families in the list 

1. Ribosomal proteins, small subunits: S2, S3, S5, S7, S8, S9, Sll, S12, 
S13, S15, S17, LI, L3, L5, L6, Lll, L13, L14, L15, L16, L22; 

2. tRNA-synthetase: Leucyl-, Phenylalanyl-, Seryl-, Valyl; 

3. Other: GTPase, DNA-directed RNA polymerase alpha subunit, Pre- 
protein translocalse subunit SecY, 

a distance matrix was calculated using BELVU [15J with 'scoredist'- 
distance correction (Sonnhammer and Hollich For each distance ma- 

trix, a set of quartet trees was inferred in the classical way by finding, for 
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each four taxon set k, I}, the unrooted quartet (IJ,KL) which maxi- 
mizes the value 

d(I, K) + d{I, L) + d(J, K) + d(J, L) - 2d(I, J) - 2d{K, L), 

where d(—, — ) denotes the respective entry in the distance matrix. 

Algorithm 1141 was then run with the parameter e = 10~ 6 ,e = 10~ 9 , 
and e = 1CP 12 on the quartet distribution obtained from analyzing all the 
28 protein families above, and in a second try on the quartet distributions 
obtained only by the ribosomal proteins. The resulting tree topologies are 
depicted in Figures H] and [SJ respectively (The root of these trees is of course 
not predicted by Algorithm [TH Rather it was placed a posteriori on the 
branch which separates archaea from bacteria on the unrooted output of 
the algorithm.) The different choices of e did not affect the result in these 
calculations. 
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MJ AF MG BB AQ SS BS EC HI 



Figure 4: The neighbor joining species tree topology for the nine prokaryotes, 
using multiple alignments for all 28 protein families. 



MJ AF AQ MG BB SS BS EC HI 



Figure 5: The neighbor joining species tree topology for the nine prokaryotes, 
using only the multiple alignments of the ribosomal proteins above. 
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5.2 Simulations with Mesquite 

As a first test of performance of Algorithm [T3] two series of simulations 
were performed as follows. For a certain choice of a species tree S on a set 
of taxa T, a coalescent process was repeatedly simulated using the coales- 
cence package of Mesquite [TO] , [11] . Each simulation yielded a set of gene 
trees on T, and from these the frequency of each unrooted quartet gene tree 
with leaves in T was determined. These frequencies where then submitted 
to Algorithm [J3] and the resulting (unrooted) tree was compared with the 
(unrooted version of the) species tree S. We report here the proportion of 
correct inferences of the unrooted species tree in the different situations. 

In the first series of simulations the underlying species tree was the 5- 
taxon caterpillar tree (a, (b, (c, (d, e)))), with all internal branch lengths set 
equal (of course, the length of the pending edges does not have an impact, as 
we consider only one lineage per taxon). Algorithm 1141 was run 1000 times 
using 5, 10, 20 and 50 sampled gene trees per trial, 500 times using 100 gene 
trees, 250 times using 200 gene trees, and 100 times using 500 simulated gene 
trees per trial. The proportion of trials which yielded the correct unrooted 
species tree topology is reported in Table [TJ Note that simulations under 
the 5-taxon caterpillar tree are also performed by Liu and Yu [7] in order 
to assess the performance of their 'neighbor joining algorithm for species 
trees'. For our choices of branch lengths and sample sizes, the performance 
of Algorithm [T3] seems to be roughly equal to the performance of 'neighbor 
joining for species trees' (Liu and Yu [7J, Figure 2). 

In a second series of simulations, the underlying species tree was the tree 
inferred by Algorithm [TJ] for the nine prokaryotes in Section I5.1[ see Figure 
[U Again, for different internal branch lengths and different numbers of gene 
trees per trial, 1000 trials (in the case of 5, 10, 20 and 50 gene trees per 
trial), and 500 resp. 250 resp. 100 trials (in the case of 100 resp. 200 resp. 
500 gene trees per trial) were run, and the proportions of correctly inferred 
unrooted species tree topologies are reported in Table [2j 

Two comments are in order: (1) In fact, for each choice of the parameter 
x (and fixed number of gene trees per trial) two simulations were performed. 
For the first, the length of the branch leading to the cherry formed by MJ 
and AF was set to Ax, while in a second simulation this branch length was 
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set to 2x. The differences in the proportions of successful trials are small in 
most cases (between one and two percent), and, as expected, in most cases 
the proportion of successful trials was bigger in the first situation. 

(2) The number of gene trees (or rather, the number of unrooted quar- 
tet gene trees for each 4-taxon subset) used for the reconstruction of the 
prokaryote species tree in Section 15.11 where 21 and 28, respectively. From 
Table [2] we see that such a species tree S is likely to be inferred correctly by 
Algorithm 1141 if its internal branch lengths are around 0.5 (with a probabil- 
ity of more than 90 percent). For branch lengths around 0.2, however, the 
probability for a correct inference of S decreases to about 40 to 50 percent. 
(However, there might still be certain clades on S which can be detected with 
high accuracy also for smaller branch lengths.) Clearly, in these considera- 
tions we ignore effects such as horizontal gene transfer, for whose existence 
there is evidence in the case of the nine prokaryotes considered in Section 
15.11 for some non-ribosomal proteins (see Teichmann and Mitchison |17j). 
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x = internal 
branch length 


5 


10 


20 


50 


100 


200 


500 


0.01 


0.107 


0.100 


0.093 


0.114 


0.124 


0.20 


0.21 


0.05 


0.162 


0.210 


0.230 


0.330 


0.430 


0.67 


0.85 


0.1 


0.285 


0.334 


0.440 


0.662 


0.815 


0.93 


1.00 


0.2 


0.428 


0.580 


0.751 


0.943 


0.990 


1.00 


1.00 


0.5 


0.755 


0.890 


0.980 


1.000 


1.000 






1.0 


0.954 


0.992 


1.000 


1.000 









Table 1: Simulation results for the 5-taxon caterpillar tree (a, (b, (c, (d, e)))) 
with all internal branch lengths set to x (in coalescent units). Table entries 
are proportions of trials which yielded the correct unrooted species tree 
topology. Columns are labelled by the number of simulated gene trees used 
in each trial. 



x = int. 
branch 1. 


5 


10 


20 


25 


50 


100 


200 


500 


a) 0.1 

b) 0.1 


0.005 
0.007 


0.027 
0.018 


0.068 
0.064 


0.102 
0.087 


0.230 
0.218 


0.320 
0.282 


0.33 
0.31 


0.37 
0.27 


a) 0.2 

b) 0.2 


0.046 
0.034 


0.181 
0.157 


0.406 
0.385 


0.498 
0.475 


0.750 
0.735 


0.876 
0.902 


0.97 
0.96 


1.00 
1.00 


a) 0.5 

b) 0.5 


0.325 
0.303 


0.626 
0.607 


0.899 
0.911 


0.966 
0.957 


0.999 
1.000 


1.000 
1.000 






a) 1.0 

b) 1.0 


0.703 
0.708 


0.873 
0.868 


0.966 
0.969 


0.980 
0.983 


0.999 
0.997 


1.000 
1.000 







Table 2: Simulation results for the prokaryote species tree shown in Figure 
|4j All internal branch lengths were set to x (in coalescent units), except for 
the edge leading from the root to (MJ,AF): the length of this edge was set 
to a) 4x and to b) 2x, respectively. Columns are labelled by the number of 
simulated gene trees used in each trial. 
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